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Abstract 


Stochastic Gradient Descent (SGD) is a workhorse in machine learning, yet its 
slow convergence can be a computational bottleneck. Variance reduction tech¬ 
niques such as SAG, SVRG and SAGA have been proposed to overcome this 
weakness, achieving linear convergence. However, these methods are either based 
on computations of full gradients at pivot points, or on keeping per data point cor¬ 
rections in memory. Therefore speed-ups relative to SGD may need a minimal 
number of epochs in order to materialize. This paper investigates algorithms that 
can exploit neighborhood structure in the training data to share and re-use infor¬ 
mation about past stochastic gradients across data points, which offers advantages 
in the transient optimization phase. As a side-product we provide a unified con¬ 
vergence analysis for a family of variance reduction algorithms, which we call 
memorization algorithms. We provide experimental results supporting our theory. 

1 Introduction 

We consider a general problem that is pervasive in machine learning, namely optimization of an em¬ 
pirical or regularized convex risk function. Given a convex loss I and a /r-strongly convex regularizer 
H, one aims at finding a parameter vector w which minimizes the (empirical) expectation: 



( 1 ) 


i=l 


We assume throughout that each fi has L-Lipschitz-continuous gradients. Steepest descent can 
find the minimizer w*, but requires repeated computations of full gradients f'{w), which becomes 
prohibitive for massive data sets. Stochastic gradient descent (SGD) is a popular alternative, in 
particular in the context of large-scale learning ||2lfT0ll . SGD updates only involve /'(ru) for an index 
i chosen uniformly at random, providing an unbiased gradient estimate, since E/' (w) = f{w). 

It is a surprising recent finding 111112191161 that the finite sum structure of / allows for significantly 
faster convergence in expectation. Instead of the standard 0{l/t) rate of SGD for strongly-convex 
functions, it is possible to obtain linear convergence with geometric rates. While SGD requires 
asymptotically vanishing learning rates, often chosen to be 0{l/t) Q, these more recent methods 
introduce corrections that ensure convergence for constant learning rates. 

Based on the work mentioned above, the contributions of our paper are as follows: Eirst, we de¬ 
fine a family of variance reducing SGD algorithms, called memorization algorithms, which includes 
SAGA and SVRG as special cases, and develop a unifying analysis technique for it. Second, we 
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show geometric rates for all step sizes 7 < S’ including a universal (p-independent) step size 
choice, providing the first /i-adaptive convergence proof for SVRG. Third, based on the above anal¬ 
ysis, we present new insights into the trade-offs between freshness and biasedness of the corrections 
computed from previous stochastic gradients. Fourth, we propose a new class of algorithms that 
resolves this trade-off by computing corrections based on stochastic gradients at neighboring points. 
We experimentally show its benefits in the regime of learning with a small number of epochs. 

2 Memorization Algorithms 

2.1 Algorithms 

Variance Reduced SGD Given an optimization problem as in Q’ we investigate a class of 
stochastic gradient descent algorithms that generates an iterate sequence (t > 0 ) with updates 
taking the form: 

= w -gi{w) = fl{w) - ai with ai := a, - a, ( 2 ) 

where a := ^ Si=i Here w is the current and w~^ the new parameter vector, 7 is the step size, 
and i is an index selected uniformly at random, cti are variance correction terms such that 'Ecti = 0 , 
which guarantees unbiasedness Epi(w) = f'{w). The aim is to define updates of asymptotically 
vanishing variance, i.e. gi{w) —>■ 0 as w —>■ w*, which requires —)■ This implies that 

corrections need to be designed in a way to exactly cancel out the stochasticity of /'(tu*) at the 
optimum. How the memory aj is updated distinguishes the different algorithms that we consider. 

SAGA The SAGA algorithm ID maintains variance corrections ai by memorizing stochastic gra¬ 
dients. The update rule is af = f[{w) for the selected i, and a'^ = aj, for j ^ i. Note that 
these corrections will be used the next time the same index i gets sampled. Setting cti := ai — a 
guarantees unbiasedness. Obviously, a can be updated incrementally. SAGA reuses the stochastic 
gradient f[{w) computed at step t to update w as well as cti. 

g-SAGA We also consider q-SAGA, a method that updates g > 1 randomly chosen aj variables 
at each iteration. This is a convenient reference point to investigate the advantages of “fresher” 
corrections. Note that in SAGA the corrections will be on average n iterations “old”. In g-SAGA 
this can be controlled to be n/q at the expense of additional gradient computations. 

SVRG We reformulate a variant of SVRG fS) in our framework using a randomization argument 
similar to (but simpler than) the one suggested in Ih). Fix g > 0 and draw in each iteration r ^ 
Uniform[0; 1). If r < q/n, a complete update, a^ = fj{w) (Vj) is performed, otherwise they are 
left unchanged. While g-SAGA updates exactly g variables in each iteration, SVRG occasionally 
updates all a variables by triggering an additional sweep through the data. There is an option to not 
maintain a variables explicitly and to save on space by storing only a = fiyS) and w. 

Uniform Memorization Algorithms Motivated by SAGA and SVRG, we define a class of algo¬ 
rithms, which we call uniform memorization algorithms. 

Definition 1. A uniform q-memorization algorithm evolves iterates w according to Eq. and 
selects in each iteration a random index set J of memory locations to update according to 

. := l ■{ ( 3 , 

■' yuj otherwise, 

such that any j has the same probability of q/n of being updated, i.e. Vj, ^ P{>7} = 

Note that g-SAGA and the above SVRG are special cases. For g-SAGA: P{ J} = l/(”) if l-fl = g 
P{ J} = 0 otherwise. For SVRG: P{0} = 1 — q/n, P{[1 : n]} = q/n, P{ J} = 0, otherwise. 

Af-SAGA Because we need it in Section]^ we will also define an algorithm, which we call N- 
SAGA, which makes use of a neighborhood system Mi C {1,..., n} and which selects neighbor¬ 
hoods uniformly, i.e. P{A/i} = Note that Definitionlllrequires |{i : j G A/i}| = g (Vj). 
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Finally, note that for generalized linear models where fi depends on Xi only through {w,Xi), we 
get fl{w) = ^[{w)xi, i.e. the update direction is determined by Xi, whereas the effective step length 
depends on the derivative of a scalar function ^i{w). As used in 0 , this leads to significant memory 
savings as one only needs to store the scalars ^'(w) as Xi is always given when performing an update. 


2.2 Analysis 

Recurrence of Iterates The evolution equation Q in expectation implies the recurrence (by cru¬ 
cially using the unbiasedness condition Egi M = riw)y 

E||w+-u;*f = \\w - w*\\^ - 2j{f'{w),w- w*) + . (4) 

Here and in the rest of this paper, expectations are always taken only with respect to i (conditioned 
on the past). We utilize a number of bounds (see ||4l), which exploit strong convexity of / (wherever 
/i appears) as well as Lipschitz continuity of the /^-gradients (wherever L appears); 

{f'{w),w- w*) > f{w) - f{w*) + ^||w - w*\\'^ , (5) 

E||g,(u;)f < 2E||/'(u;) - + 2E||ai - , ( 6 ) 

WflM - /■(w*)f < 2Lh,{w), hi{w) := fi{w) - Mw*) - {w - w* J-{w*)) , (7) 

nf>)-f'{w*)f<2Lf\w), f{w):=fiw)-f{w*), ( 8 ) 

E||a, - /'(u;*)f = Ella, - - ||af < E||a, - /'(u;*)f. (9) 

Eq. (|^ can be generalized llU using ||a;±t/|p < (l + /3)||a:||^ + (l+/3“^)||t/|p with /3 > 0. However 
for the sake of simplicity, we sacrifice tightness and choose /3 = 1. Applying all of the above yields; 
Lemma 1. For the iterate sequence of any algorithm that evolves solutions according to Eq. (|^, the 
following holds for a single update step, in expectation over the choice of i: 

llu; - - E||r(;+ - > 'ygW'w - - 27 ^E||Q;i - /'(w*)||^ + (27 - 47 ^L) f^{w). 

All proofs are deferred to the Appendix. 


Ideal and Approximate Variance Correction Note that in the ideal case of Ui = we 

would immediately get a condition for a contraction by choosing yielding a rate of 1 — p 

with p = 7 /i = which is half the inverse of the condition number k ;= Ljg. 

How can we further bound E||ai — /i(tu*)|P in the case of “non-ideal” variance-reducing SGD? A 
key insight is that for memorization algorithms, we can apply the smoothness bound in Eq. 0 

l|a* - fi{w*)\? = - fiiw*)\\‘^ < 2Lhi{w'^'), (where is old w). ( 10 ) 

Note that if we only had approximations j3i in the sense that ||/3i — ctilP < Ci (see Section]^, then 
we can use ||a; — y\\ < 2 ||a;|| + 2\\y\\ to get the somewhat worse bound; 

11/3* - mw*W < 2||a* - mw*W + 211/3* - a*f < + 2 e,. ( 11 ) 


Lyapunov Function Ideally, we would like to show that for a suitable choice of 7, each iteration 
results in a contraction E||r(;+ — r(;*|p < (1 — p)||r(; — where 0 < p < 1. However, the main 

challenge arises from the fact that the quantities ai represent stochastic gradients from previous iter¬ 
ations. This requires a somewhat more complex proof technique. Adapting the Lyapunov function 
method from 0 , we define upper bounds Hi > ||ai — fliw*)\f such that Hi ^ 0 as w ^ w*. We 
start with q;° = 0 and (conceptually) initialize Hi = H//(tc*)P, and then update Hi in sync with 


r 2L hi{w) 

\h. 


if ai is updated 
otherwise 


( 12 ) 


so that we always maintain valid bounds ||ai — /i(w*)||^ < Hi and E||ai — //< H with 
H := ^ YyJi=i Hi- The Hi are quantities showing up in the analysis, but need not be computed. We 
now define a cr-parameterized family of Lyapunov function^ 


£a{w,H) 


lltu — w* 


Sa H, 


with S ;= 



and 0 < cr < 1. 


( 13 ) 


’This is a simplified version of the one appearing in jTl, as we assume f'{w*) = 0 (unconstrained regime). 
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In expectation under a random update, the Lyapunov function Ca changes as 'EiCa{w^, H+) = 
E|| + Sa 'EiH^. We can readily apply LemmafLto bound the first part. The second part 

is due to which mirrors the update of the a variables. By crucially using the property that any 
aj has the same probability of being updated in ([^, we get the following result; 

Lemma 2. For a uniform q-memorization algorithm, it holds that 

^H+ (14) 
\ n J n 

Note that in expectation the shrinkage does not depend on the location of previous iterates w'^ and 
the new increment is proportional to the sub-optimality of the current iterate w. Technically, this is 
how the possibly complicated dependency on previous iterates is dealt with in an effective manner. 


Convergence Analysis We first state our main Lemma about Lyapunov function contractions; 
Lemma 3. Fix c S (0; 1] and a S [0; 1] arbitrarily. For any uniform q-memorization algorithm with 
sufficiently small step size 7 such that 


1 . 

7 < — min 
' - 2 L 


Ka 


-,1 - cr 


and K ;= 


AqL 
np, ’ 


(15) 


^K + 2ca' 

we have that 

'EiCa{'w^,F[^) < — p)Ca{w,F[), with p:=cp'-f. (16) 

Note that 7 < ^ niaxCTg[o,i] min{(T, 1 — cr} = (in the c —>■ 0 limit). 

By maximizing the bounds in Lemma over the choices of c and cr, we obtain our main result that 
provides guaranteed geometric rates for all step sizes up to ^ 


4L- 

Theorem 1. Consider a uniform q-memorization algorithm. For any step size 1 = jf with a < 1, 
the algorithm converges at a geometric rate of at least (1 — ^( 7 )) with 

pt^) = i . = JL . K{l-a) 

n l-a/2 4L 

where 


y*{K) ;= 


'*{K) 
4L ’ 


l-a /2 
a*(K) ;= 


if 1 > otherwise p(y) = p'j (17) 


2K 


l + K + s/l + K"^’ 


K ;= 


AqL 4:q 
np n 


(18) 


We would like to provide more insights into this result. 

Corollary 1. In Theorem^ p is maximized for 7 = y*(K). We can write p*{K) = p{'j*) as 


In the big data regime p* = -(1 — 

^(l-iiT-i+0(iT-3)). 




1 + K + s/l + K‘^_ 

0(K^)), whereas in the ill-conditioned case p 


(19) 


The guaranteed rate is bounded by in the regime where the condition number dominates n (large 
K) and by ^ in the opposite regime of large data (small K). Note that if AT < 1, we have p* = Cf 
with C G [2/(2 + V2); 1] « [0.585; 1]. So for q ^ it pays off to increase freshness as it affects 
the rate proportionally. In the ill-conditioned regime (k > n), the influence of q vanishes. 

Note that for 7 > j*{K), 7 —5" jx rate decreases monotonically, yet the decrease is only minor. 
With the exception of a small neighborhood around the entire range of 7 G [7*1 jj) results in 
very similar rates. Underestimating 7 * however leads to a (significant) slow-down by a factor 7 / 7 *. 

As the optimal choice of 7 depends on K, i.e. p, we would prefer step sizes that are p-independent, 
thus giving rates that adapt to the local curvature (see j^)- It turns out that by choosing a step size 
that maximizes xmciK p('y) /p*{K), we obtain a AT-agnostic step size with rate off by at most 1/2; 

Corollary 2. Choosing 7 = leads to ^( 7 ) > (2 — \/2)p*(K) > ^p*{K) for all K. 


To gain more insights into the trade-offs for these fixed large universal step sizes, the following 
corollary details the range of rates obtained; 

Corollary 3. Choosing 1 = ff with a < 1 yields p = min{ ’ t particular, we have 

for the choice 7 = ^ that p = min{ | ^1 7 } (roughly matching the rate given in for q = 1). 
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3 Sharing Gradient Memory 

3.1 e- Approximation Analysis 


As we have seen, fresher gradient memory, i.e. a larger choice for q, affects the guaranteed conver¬ 
gence rate as p ~ q/n. However, as long as one step of a g-memorization algorithm is as expensive 
as q steps of a 1 -memorization algorithm, this insight does not lead to practical improvements per 
se. Yet, it raises the question, whether we can accelerate these methods, in particular A/^-SAGA, 
by approximating gradients stored in the ai variables. Note that we are always using the correct 
stochastic gradients in the current update and by assuring ^^ oti = 0 , we will not introduce any bias 
in the update direction. Rather, we lose the guarantee of asymptotically vanishing variance at w*. 
However, as we will show, it is possible to retain geometric rates up to a J-ball around w*. 


We will focus on SAGA-style updates for concreteness and investigate an algorithm that mirrors N- 
SAGA with the only difference that it maintains approximations fii to the true ai variables. We aim 
to guarantee E||ai — /3i|p < e and will use Eq. to modify the right-hand-side of Lemma[^ We 
see that approximation errors are multiplied with 7^, which implies that we should aim for small 
learning rates, ideally without compromising the JV-SAGA rate. From Theorem[T and Corollary [T] 
we can see that we can choose 7 < q/p-n for n sufficiently large, which indicates that there is hope 
to dampen the effects of the approximations. We now make this argument more precise. 

Theorem 2. Consider a uniform q-memorization algorithm with a-updates that are on avera ge e- 
accurate (i.e. Ejla^ — < e). For any step size 7 < f{K), where 7 is given by Corollary^in 

the appendix (note that f{K) > and j{K) —>■ 7*(AT) as K —>■ Oj, we get 

EC{w\H^) < {l-pyYCo + —, with Co := |k° - + s( 7 )E||/,(u;*)f, (20) 

where E denote the (unconditional) expectation over histories (in contrast to E which is conditional), 
ands{y) := p^{l-2Ly). 


Corollary 4. With 7 = min{p,, 7 (Ar)} we have 

—^ < 4e, with a rate p = min{^^, ^ 7 } . 
P 


( 21 ) 


In the relevant case of p ^ 1/ y/n, we thus converge towards some Y^-ball around w* at a similar 
rate as for the exact method. For p ~ we have to reduce the step size significantly to com¬ 
pensate the extra variance and to still converge to an y^-ball, resulting in the slower rate p ~ 
instead of p ^ n~^. 


We also note that the geometric convergence of SGD with a constant step size to a neighborhood 
of the solution (also proven in lUl) can arise as a special case in our analysis. By setting = 0 in 
Femmaj^ we can take e = E|j/'(w*)|p for SGD. An approximate ^-memorization algorithm can 
thus be interpreted as making e an algorithmic parameter, rather than a fixed value as in SGD. 


3.2 Algorithms 


Sharing Gradient Memory We now discuss our proposal of using neighborhoods for sharing 
gradient information between close-by data points. Thereby we avoid an increase in gradient com¬ 
putations relative to q- or M-S>AGA at the expense of suffering an approximation bias. This leads 
to a new tradeoff between freshness and approximation quality, which can be resolved in non-trivial 
ways, depending on the desired final optimization accuracy. 


We distinguish two types of quantities. First, the gradient memory ai as defined by the reference 
algorithm Af-SAGA. Second, the shared gradient mernory state Pi, which is used in a modified 
update rule in Eq. ( 0 , i.e. = w — y{f'i{w) — Pi + P). Assume that we select an index i for the 
weight update, then we generalize Eq. 0 as follows 




f/iM ifj'eM 

\Pj otherwise ’ 


p-.= -Y.p^ 

71 ^ 


2=1 


( 22 ) 


In the important case of generalized linear models, where one has /'(w) = we can modify 

the relevant case in Eq. ( |2^ by P^ := ^[{w)xj. This has the advantages of using the correct 
direction, while reducing storage requirements. 
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Approximation Bounds For our analysis, we need to control the error ||ai — < e^. This 

obviously requires problem-specific investigations. 

Let us first look at the case of ridge regression. fi{w) := ^{{xi,w) — Ui)^ + f ||w|P and thus 
f-{w) = ^[{w)xi + Xw with ^■(u>) := {xi, w) — yi. Considering j S Mi being updated, we have 

11“/ “ < (^ijlkll + 1% - yt\) lla^jll =: (23) 

where 5ij := ||a:i — ccjlj. Note that this can be pre-computed with the exception of the norm ||ui|| 
that we only know at the time of an update. 


Similarly, for regularized logistic regression with y G {—1,1}, we have ^'(w) = yi/{l + ). 

With the requirement on neighbors that yi = yj we get 


||a 


-I- 

j 



i5ij||iu|| _ 1 


(24) 


Again, we can pre-compute 6ij and ||a:j |j. In addition to Cii'^) we can also store {xi, w). 


eM -SAGA We can use these bounds in two ways. First, assuming that the iterates stay within a 
norm-ball (e.g. L 2 -ball), we can derive upper bounds 

ej{r) > ma.x{e^J{w) : j G M, ||w|| < r}, e{r) = ^ X! 

3 

Obviously, the more compact the neighborhoods are, the smaller e(r). This is most useful for the 
analysis. Second, we can specify a target accuracy e and then prune neighborhoods dynamically. 
This approach is more practically relevant as it allows us to directly control e. However, a dynam¬ 
ically varying neighborhood violates Definition We hx this in a sound manner by modifying the 
memory updates as follows: 

{ /' {w) if j G Mi and {w) < e 
fj{w) if j G Mi and eij{w) > e (26) 

Pj otherwise 

This allows us to interpolate between sharing more aggressively (saving computation) and perform¬ 
ing more computations in an exact manner. In the limit of e —0, we recover A/'-SAGA, as e — 
we recover the hrst variant mentioned. 


Computing Neighborhoods Note that the pairwise Euclidean distances show up in the bounds in 
Eq. ( [2^ and ( |24l l. In the classification case we also require yi = yj, whereas in the ridge regression 
case, we also want \yi — yj \ to be small. Thus modulo filtering, this suggests the use of Euclidean 
distances as the metric for defining neighborhoods. Standard approximation techniques for finding 
near(est) neighbors can be used. This comes with a computational overhead, yet the additional costs 
will amortize over multiple runs or multiple data analysis tasks. 


4 Experimental Results 

Algorithms We present experimental results on the performance of the different variants of mem¬ 
orization algorithms for variance reduced SGD as discussed in this paper. SAGA has been uniformly 
superior to SVRG in our experiments, so we compare SAGA and eM-SAGA (from Eq. (|2^), along¬ 
side with SGD as a straw man and q-SAGA as a point of reference for speed-ups. We have chosen 
<7 = 20 for q-SAGA and eM -SAGA. The same setting was used across all data sets and experiments. 

Data Sets As special cases for the choice of the loss function and regularizer in Eq. ([T]), we con¬ 
sider two commonly occurring problems in machine learning, namely least-square regression and 
^ 2 -regularized logistic regression. We apply least-square regression on the million song year regres¬ 
sion from the UCI repository. This dataset contains n = 515,345 data points, each described by 
d = 90 input features. We apply logistic regression on the cov and ijcnnl datasets obtained from 
the libsvm website]^ The cov dataset contains n = 581, 012 data points, each described by d = 54 
input features. The ijcnnl dataset contains n = 49,990 data points, each described by d = 22 input 
features. We added an £ 2 -regularizer H(w) = p.||w ||2 to ensure the objective is strongly convex. 

"^http: //www.csie.ntu.edu.tw/~cjlin/libsvratools/datasets 
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(a) Cov 



(b) Ijcnnl (c) Year 



^ — 10 ^, gradient evaluation 





^ — 10 ^, gradient evaluation 





/r — 10 ^, datapoint evaluation 





/i — 10 ^, datapoint evaluation 


Figure 1: Comparison of eAf -SAGA, q-SAGA, SAGA and SGD (with decreasing and constant step 
size) on three datasets. The top two rows show the suboptimality as a function of the number 
of gradient evaluations for two different values of fi — 10“^, 10“^. The bottom two rows show 
the suboptimality as a function of the number of datapoint evaluations (i.e. number of stochastic 
updates) for two different values of ^ = 10“^, 10“^. 
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Experimental Protocol We have run the algorithms in question in an i.i.d. sampling setting and 
averaged the results over 5 runs. Figure shows the evolution of the suboptimality of the ob¬ 
jective as a function of two different metrics: (1) in terms of the number of update steps performed 
(“datapoint evaluation”), and (2) in terms of the number of gradient computations (“gradient evalua¬ 
tion”). Note that SGD and SAGA compute one stochastic gradient per update step unlike q-SAGA, 
which is included here not as a practically relevant algorithm, but as an indication of potential im¬ 
provements that could be achieved by fresher corrections. A step size 7 = ^ was used everywhere, 
except for “plain SGD”. Note that as AT ^ 1 in all cases, this is close to the optimal value suggested 
by our analysis; moreover, using a step size of ~ ^ for SAGA as suggested in previous work ||9| 
did not appear to give better results. For plain SGD, we used a schedule of the form 7t = 7o/f with 
constants optimized coarsely via cross-validation. The a:-axis is expressed in units of n (suggestively 
called ’’epochs”). 

SAGA vs. SGD cst As we can see, if we run SGD with the same constant step size as SAGA, 
it takes several epochs until SAGA really shows a significant gain. The constant step-size variant 
of SGD is faster in the early stages until it converges to a neighborhood of the optimum, where 
individual runs start showing a very noisy behavior. 

SAGA vs. 9 -SAGA q-SAGA outperforms plain SAGA quite consistently when counting stochas¬ 
tic update steps. This establishes optimistic reference curves of what we can expect to achieve with 
eN' -SAGA. The actual speed-up is somewhat data set dependent. 

eAf -SAGA vs. SAGA and g-SAGA eA/'-SAGA with sufficiently small e can realize much of the 
possible freshness gains of g-SAGA and performs very similar for a few (2-10) epochs, where it 
traces nicely between the SAGA and g-SAGA curves. We see solid speed-ups on all three datasets 
for both /i = 0.1 and /r = 0.001. 

Asymptotics It should be clearly stated that running eAf-SAGA at a fixed e for longer will not 
result in good asymptotics on the empirical risk. This is because, as theory predicts, dV -SAGA 
can not drive the suboptimality to zero, but rather levels-off at a point determined by e. In our 
experiments, the cross-over point with SAGA was typically after 5 — 15 epochs. Note that the gains 
in the first epochs can be significant, though. In practice, one will either define a desired accuracy 
level and choose e accordingly or one will switch to SAGA for accurate convergence. 

5 Conclusion 

We have generalized variance reduced SGD methods under the name of memorization algorithms 
and presented a corresponding analysis, which commonly applies to all such methods. We have 
investigated in detail the range of safe step sizes with their corresponding geometric rates as guar¬ 
anteed by our theory. This has delivered a number of new insights, for instance about the trade-offs 
between small (~ ^) and large (~ -^) step sizes in different regimes as well as about the role of 
the freshness of stochastic gradients evaluated at past iterates. 

We have also investigated and quantified the effect of additional errors in the variance correction 
terms on the convergence behavior. Dependent on how fj, scales with n, we have shown that such 
errors can be tolerated, yet, for small may have a negative effect on the convergence rate as much 
smaller step sizes are needed to still guarantee convergence to a small region. We believe this result 
to be relevant for a number of approximation techniques in the context of variance reduced SGD. 

Motivated by these insights and results of our analysis, we have proposed eAf-SAGA, a modification 
of SAGA that exploits similarities between training data points by defining a neighborhood system. 
Approximate versions of per-data point gradients are then computed by sharing information among 
neighbors. This opens-up the possibility of variance-reduction in a streaming data setting, where 
each data point is only seen once. We believe this to be a promising direction for future work. 

Empirically, we have been able to achieve consistent speed-ups for the initial phase of regularized 
risk minimization. This shows that approximate computations of variance correction terms consti¬ 
tutes a promising approach of trading-off computation with solution accuracy. 

Acknowledgments We would like to thank Yannic Kilcher, Martin Jaggi, Re mi Leblond and the 
anonymous reviewers for helpful suggestions and corrections. 
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A Appendix 

Lemma 1 . For the iterate sequence of any algorithm that evolves solutions according to Eq. (|^, the 
following holds for a single update step, in expectation over the choice of i, with 
A := \\w — w*\\'^ — E||w+ — then: 

A > yp.\\w - -u;*|p - 2 y‘^E,\\a^ - + (27 - 47^^) f{w). 

Proof Starting from Eq. Q we have 

A = 2 y{f{w),w - w*) - 7^E||gj(u>)||^ 

^'rp,\\w - w*f + 2 jf{w) - 7^E||5i(ui)|p 

9 jfi\\w - + 2 'yf{w) - 2 y‘^E.\\fl{w) - /'(w*)f - 2 j^E\\ai - /'(w*)f 

- w*f + 27(1 - 2 yL)f{w) - 272E||a, - f’{w*)f . 

□ 


Lemma 2. For a uniform q-memorization algorithm, it holds that 

\ n J n 


Proof From the uniformity property (*) in Definitionit follows that 

nEEr+ = ^ Ei/+ ((1 - Lf, + 2L h,{w)) = (n - g)iJ + ^ ^ h,{w). 

i=i i=i n n n 

Exploiting the fact that ^ X]r=i hi{w) = f{w) — f{w*) + 0 = f^{w) completes the proof. □ 


Lemma 3. Fix c G (0; 1] and a G [0; 1] arbitrarily. For any uniform q-memorization algorithm with 
sufficiently small step size 7 such that 


1 . 

7 < — min 
L 


Ka 1 — cr'[ 
2K + 4ccr ’ 2 j ’ 


and 


K := 


AqL 
np, ’ 


we have that 


,F['^) < (1 — p)Ca{w,Fl), with p := cpy. 
Note that 7 < 2 X '^9'^a'G[o.i] minjcr, 1 — cr} = ^ (in the c —)■ 0 limit). 


(27) 


Proof. From Lemma we can see that we will have p < 7 /r based on the Hw — w*|p part of C^. 
Hence, we can write the rate as p = cp 7 , where 0 < c < 1. 

Let us now apply both. Lemma [T] and Lemma to quantify the progress guaranteed to be made in 
one iteration of the algorithm in expectation, combining the changes to the iterate w -G as well 
as those to the memory a -G into C„. Set Ag- := C„(w, H) — E£+ [w, Ft), then 

A^ =||u;-r(;*f-E||u;+-u;*f+ S'a (iJ - EiJ+) (28) 

>yp\\w - w*f - 272 E|la, - /'(w*)f + 27 (1 - 2yL) f\w) 

+ s.(h-(^)/1-^/»)). 

As we argued after Eq. the definition of Hi combined with property ( [T0| ensure the crucial 
bound Ejlofi — fl{w*)\\‘^ < H. Including it and gathering terms in the same “units”, we get: 


Ao- > yp\\w - ■ u;*|l + 




H + 2 


+ 7 ( 1 - 27 L)] f{w) 


(29) 
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We can further simplify the term in the second rectangular brackets with the definition of S (in 
hindsight motivating its definition); 

-2SaL + 27 (1 - 27 L) = 27 [-a + (1 - 27 L)] (30) 

We require this term to be non-negative, so that we can safely drop it. This leads an upper bound 
requirement on the step size: 

1 — tr 

—a + 1 — 27 L > 0 7 < . (31) 

ZIj 

The term in the first rectangular brackets in Eq. ( |29] ) needs to be > pSa in order to recover pCa = 
p (||u> — r(;*|p + Safi). Inserting the definition of S, p and dividing by 7 yields 


a 

L 


„ pSa na 4 c (77 

27 > - = C7p— = —— 

7 Lq K 


7 < 


1 Ka 
L2K + 4ccr 


(32) 


We can summarize the derivation in the claimed combined inequality. 


□ 


Theorem 1. Consider a uniform q-memorization algorithm. For any step size 7 = with a < 1 
the algorithm converges at a geometric rate of at least (1 — ^( 7 )) with 

/ \ 9 1-a p K{l-a) ^ , X 

P(7 = - ■ 1 - 77 , = JT ■ -j-'/7 > 7 (K), otherwise p 7 ) = p 7 

n 1 — a /2 4T 1 — aj I 


where 


y*iK):=°^, a*{K):= 


2K 


1-f iT +Vl + iT2’ 


K := 


4qL 

np 


Proof. Consider a hxed 7 < gx- There are potentially (inhnitely) many choices of (c, cr) that 
fulfill the condition in Eq. Among those, the largest rate is obtained by maximizing c < 1 as 
p( 7 ) = cpy. Note that for any 7 that does not achieve Eq. with equality for both terms, one 
can find a larger 7 with the same choice of c by either increasing (slack in the first inequality) or 
decreasing (slack in the second inequality) a. We thus focus on step sizes that are maximal for some 
choice of (c, a). Equality with the second bound directly gives us 

=1 ^ a* = 1- 2Ly. (33) 

We plug this into the first bound and again equal 7 , which yields an optimality condition for c 


Ly 


Ka* 

2K + 4ca* 


47 L |_ a* 


K I- 4 L 7 
47L 1 — 2L7 


and thus 


(34) 


p = cpj = 


pK 1 — 4 L 7 
^ 1 - 2L7 


q 1 — 4 L 7 
n 1 — 2 L 7 


(35) 


It remains to check what the admissible 
required. The latter is determined by the 

c* > 0 , 


range of 7 is that achieves the bound in Eq. as we 
constraints c S (0; 1]. Erom Eq. (|34li we can read off for 


1 - 4L7 > 0 


7 < 


1 

4L ■ 


(36) 


At the other extreme of c = 1 we can solve the resulting quadratic equation in 7 

^ q I- 4 L 7 ^ K 1- 4 L 7 
np 1 — 2 L 7 4L 1 — 2 L 7 

to get 7 = y*{K) as claimed in Eq. ( [T8] l (excluding the second root which yields 7 > ix)- More¬ 
over, for 7 < 7 * (AT) we choose c = 1 to maximize the rate and have p = pj. □ 
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Corollary 1. In Theorem^ p is maximized for 7 = 7* (if). We can write p*{K) = p{'j*) as 


4L n K n 


l + K + y/l+l^ 


In the big data regime p* = ^{1 — + 0{K^)), whereas in the ill-conditioned case p* = 

^(1_ lX-l+0(if-3)). 


Proof. Plugging in the definitions of 7 * (if) and K and performing some symbolic simplifications 
yields the result. □ 

Corollary 2. Choosing 7 = , leads to ^( 7 ) > (2 — V^)p* > \p*. 


Proof Write 7 = then if 7 > 7 *(if): > 737 ^, otherwise: ^ > a, with equality 

when if = 00 . Setting both equal yields a = 2 — s/2 « 0.5858. □ 

Corollary 3. Choosing 7 = ^ with a < 1 yields p = min{ 7 , f ^}- particular, we have 

for the choice 7 = ^ that p = min{ | ^, | ^ }■ 

Proof If 7 > 7 *(i?) then p = 7^7 = 5 otherwise, p = py = = \P-. □ 

Theorem 2. Consider a uniform q-memorization algorithm with a-updates that are on average e- 
accurate (i.e. E||Q!i — < e). For any step size 7 < 7 (if), where 7 is given in Eq. ( |39| l in 

Corollary^below (note that 7 (if) > 77 * (if) and y{K) —>■ 7 * (if) as K ^ Oj, we get 

E£(u;‘,ii‘) < (l-p 7 )*£o + —, with Cq := ||w° - ■u;*|p + s( 7 )E||/,(w*)f, (38) 

where E denote the (unconditional) expectation over histories (in contrast to E which is conditional), 
ands{y) := |^(l- 2 £ 7 ). 


Proof Following the same line of argument as in Lemma and Theorem [T] with the modifications 
summarized in Corollary]^ 

E£(rt;''‘, H'^) < (1 — yp)C{w, H) + 47^6 

and unrolling the recurrence over f , 


EC{w/H*) < (1 - 7 /r)‘£(u>°, H°) + 


t-i 




47^6 


using applied with a; = (1 — p) (see JS] for its use for constant step size SGD). 

According to Eq. ( [T3| ), C{w^, H^) = ||w — w*|p + SaH^, where S = — As the algorithm 

initializes a^to 0, we have = E||/'(u;*)|p. Finally, the proof of Corollary ^follows the proof 
of TheoremfPand also gives a* = 1 — 2Ly as in Eq. ( |?3] l. Substituting in C{w^/H^) gives Cq. □ 


Corollary 4. With 7 = min{p,, 7 (if)} we have 


-Y— < 4e, with a rate p = min{p^, ^ 7 } . 


Proof By definition, 7 < p, thus 7 /p < 1, yielding the first claim. By definition, we also have 
7 < l{K), thus from Theorem [^adapted to Corollary]^ we know that p = py, which concludes 
the proof. (Note that with 7 > ^K) we will increase the error, while decreasing p/y) < p = py. 
This is why this choice is not sensible according to our theory.) □ 
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Corollary 5 (Patch-ups). Using Eq. ([n) instead ofEq. but then setting = 0 yields the same 
results as before with the following changes: 

(a) Lemmaj^ the bound becomes J ^ min{| 

(b) Theorem^ still using 7 = jf, we require a < | and we get a similar (slightly smaller) 
expression for p as well as for the optimal step size j{K) := replacing 7 * in the theorem: 


P = 


g 1 - §a 
n 1 — ia 


and d(K) = 


2K 


l+^K + Jl + K + (^K) 


= > -a*(K). 

2-3 ^ ^ 


(39) 


(c) The optimal asymptotic rate is still p 


Q 

n ‘ 


Proof Redoing all proofs with an additional factor of 2 on the RHS of Eq. One can also 

readily verify that the ratio (with a*{K) defined in Eq. ( fTS) !) is a decreasing function of K, 

with value 1 for K = 0, and limiting value | for K —>■ 00 . □ 


A.l Implementation details 

Construction of the neighborhoods required by q-SAGA and A/^-SAGA: Eor each datapoint i, 
we want to define a neighborhood A/) (defined as the set of children of i in a directed graph) such 
that for j G Afi, \\aj — < e. The approximation bounds in Sectionj^show that this distance 

is a function of w, which is not known a priori. In order to address this issue, we used the distance 
between data points 5ij := ||a;i — a;j|| as a surrogate. The construction of the neighborhoods then 
amounts to constructing a directed graph on n nodes by setting the q nearest points to j as its 
parenfsj^This ensures that |{i : j G A/i}| = q (Vj), i.e. every j has exactly (/parents. Note that this 
simple construction can yield asymmetric neighborhoods (i.e. j G Afi i G Afjf, Afi is the set of 
children of i, and does not have to be of size q. One could also construct a symmetric neighborhood 
by defining j to be a child of i if their distance is less than y/e (which is a symmetric relationship), 
where e is a constant chosen such that q « 20. In practice, we did not find this construction to yield 
better performance (in addition to violating the uniform (/-memorization property). Note also that 
the above constructions ensure that i G Afi- 


Growing n heuristic: Eor all the (/-memorization algorithms, we used the same initialization 
heuristic proposed in 0 |4l for which during the first pass, datapoints are introduced one by-one, 
with averages computed in terms of the number datapoints processed so far (i.e. the normalization 
for a is the number of different points seen so far instead of n). 


^This can be naively implemented by computing all pairwise distances Sij between data points (0(n^)), but 
more efficient data structures using hashing jT) or randomized partition trees (3) can be used. 
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